Introduction

There is a constant struggle to ensure people are being given access to the correct amount of aid they need to survive. Some programs target some of the poorest populations to ensure they are being properly taken care of. Unfortunately, some of these poorer communities are unable to correctly and accurately document that they qualify for the amount of aid they tend to need. The goal of this project is to take observable attributes of a given household and bucket them into different poverty levels.

The data provided includes 142 predictor variables, with a decent spread between categorical, numeric, and binary values. Our response, Target, is a categorical variable with 4 levels, and each row represents an observed individual. Below is a list of variables with some omitted or modified for clarity:

Variable Type Description
v2a1 Numeric Monthly rent payment
hacdor Numeric Overcrowding by bedrooms
rooms Numeric Number of rooms in house
hacapo Numeric Overcrowding by all rooms
v14a Binary Has bathroom in household
refrig Binary Has refrigerator in household
v18q1 Numeric Number of tablets household owns
r4t1 Numeric Persons younger than 12 years
r4t2 Numeric Persons older than 12 years
escolari Numeric Years of schooling
rez_esc Numeric Years behind in school
hhsize Numeric Household size
pared Categorical Wall material
piso Categorical Floor material
techo Categorical Roof material
cielorazo Binary Presence of ceiling in home
abastagua Categorical Home water source
elec Categorical Home electricity source
sanitario Categorical Home plumbing type
energcocinar Categorical Home kitchen type
elimbasu Categorical Home waste disposal type
epared Numeric Wall quality
etecho Numeric Roof quality
eviv Numeric Floor quality
dis Binary Individual is disabled
gender Binary Individual gender
estadocivil Categorical Individual civil status
parentesco Categorical Relation to head of household
hogar_nin Numeric Individuals under 19
hogar_adul Numeric Individuals between 19 and 65
hogar_mayor Numeric Individuals >65
dependency Numeric Ratio of dependents/independents
edjife Numeric Education of head of household (male)
edjifa Numeric Education of head of household (female)
meaneduc Numeric Mean years of education in household
instlevel Categorical Highest form of education achieved
bedrooms Numeric Number of bedrooms
tipovivi Categorical House status (rent, own, etc)
computer Binary Presence of household computer
television Binary Presence of household TV
lugar Categorical Region
area Binary Urban/Rural
age Numeric Individual age
Target Categorical Household poverty level

Furthermore, our response variable Target has the following categories which we will attempt to classify households into:

Level Description Count
1 Extreme poverty 211
2 Moderate poverty 420
3 Vulnerable 339
4 Non-vulnerable 1849

As shown by the counts, the data is rather imbalanced within the levels of vulnerability will be accounted for in future processes.

Data Cleaning and Exploration

There are several variables, such as SQBescolari, SQBage, SQBhogar_total, SQBedjefe, SQBhogar_nin, SQBovercrowding, SQBdependency, SQBmeaned, agesq, that we deemed irrelevant to our analysis. These variables are squares of other existing variables so they were removed from the dataset to avoid colinearity with their non-square counterparts. Additional data cleaning steps address missing values, standardize data formats, and compute new variables to ensure the dataset is ready for analysis. Missing values are filled with logical defaults or calculated averages. One example of this is replacing v2a1 (rent payment) and v18q1 (tablets owned) with zero, or using the mean education level for specific age groups to impute meaneduc. Categorical variables, like edjefa and edjefe, are converted into binary numeric formats. Dependency ratios are recalculated at the household level using age group distributions, and a binary indicator for school attendance is created.

Once these steps have been executed, new variables were constructed to address the ambiguity of some of the existing features. Some of these variables create household-level summaries, such as counts of individuals in school, children behind in their schooling, and disabled household members, and identifies households with non-family members. A filter is applied to the dataset to include only heads of households and removes redundant or constant features. Housing quality is quantified through composite scores for interior, exterior, and overall quality based on materials and utilities. To visualize household dependencies, a simple scatterplot was created.

As we can see from this plot, as dependency ratio increases, the house quality seems to have a slightly larger spread which could indicate that having more dependents could have an impact on the quality of a household. Another argument that could be made is that people that have more dependents appear to be in a position to provide for these dependents since there seems to be an upward trend of the house quality score as the dependency ratio increases. Additionally, there could be a relationship between the education level and the house quality. To view this, another scatterplot was made.

As we can see in this plot, the house quality appears to have a positive relationship with average education level. There is also a smaller dependency ratio as both house quality and education level rise.

The Statistical Model

Base Model

Our first step is to get a baseline model. Since our response is categorical with 4 levels, we will be using a multinomial logistic regression model. We separated our data with a 70% train/test split and created a model using all variables created and retained in the data cleaning process. The performance of this model is shown below:

Base Model 1 2 3 4
Sensitivity 0.22 0.31 0.05 0.91
Specificity 0.96 0.90 0.97 0.46

The initial model correctly classifies true non-vulnerable households quite well as shown by the high sensitivity in class 4. However because of the low specificity in class 4 there is potential that the model is overclassifying households into the non-vulnerable category. In the other classes the model correctly identifies true negatives very often as shown by the high specificities in classes 1, 2, and 3. The low sensitivity values in classes 1, 2, and 3 as well as the low specificity value in class 4 suggests that the model is overclassifying households as non-vulnerable to poverty.

To improve our model, we first attempted forward, backward, and step-wise model selection to see if simply removing some of the predictors would improve our model. While each of these techniques performed similarly to the full model they also had the same issues of low sensitivity as the full model.

Backward Model 1 2 3 4
Sensitivity 0.22 0.33 0.06 0.91
Specificity 0.97 0.91 0.97 0.44
Forward/Step-wise Model 1 2 3 4
Sensitivity 0.17 0.36 0.00 0.94
Specificity 0.97 0.90 0.99 0.40

Next, elastic net with crossfold validation (10 folds) was employed. We initially attempted ridge regression, but noticed that the model was only predicting cases from levels 2 and 4; the two largest groups within the data. So, next we tried LASSO regression, and had nearly identical results. Finally, in order to rule out elastic net as a viable strategy for our data, we tested every \(\alpha\) value between 0 and 1 in intervals of 0.05. Out of all 20 tests, \(\alpha=0.3\) was able to make predictions on levels 1, 2, and 4, \(\alpha=0.7\) was able to make predictions on all 4 classes, and all others were only able to predict on 2 and 4. However, neither \(\alpha=0.3\) nor \(\alpha=0.7\) were able to make accurate predictions, so none of the elastic net models will be used.

We hypothesized that, since the predictors appeared to not be a limiting factor, the class imbalance must be the reason for poor predictions. To address this issue we employed two difference strategies. The first being randomized under-sampling of the training data. This strategy was performed by randomly selecting 150 values from each of the 4 classes and training our model on only these values. All previously mentioned procedures were conducted on the balanced model as well. The balanced model and it’s various derivatives resulted in better balance of our sensitivity values but they were still relatively low as shown in the balanced full model below:

Balanced Model 1 2 3 4
Sensitivity 0.48 0.31 0.30 0.68
Specificity 0.85 0.85 0.83 0.83

The second strategy was testing a few different weights on each of the class probability predictions. The weights, along with the performance of their corresponding models, are shown below:

Weight \(\kappa\) AUC
None 0.26 0.65
\(\frac{1}{\text{prop}}\) 0.27 .070
\(\frac{1}{\sqrt{\text{prop}}}\) 0.31 0.68

As shown, both of these models have a larger \(\kappa\) value and ROC-AUC scores which suggest they perform better then the unweighted model. The trade off, however, is that neither of these new models are nearly as accurate at predicting class 4 as the unweighted version as shown below.

Inverse Proportion Model 1 2 3 4
Sensitivity 0.50 0.30 0.32 0.66
Specificity 0.88 0.87 0.82 0.83
Inverse Square Root Proportion Model 1 2 3 4
Sensitivity 0.37 0.35 0.16 0.83
Specificity 0.93 0.88 0.92 0.65

Multi-Model Approach

Since our weighted model did so much better in predicting classes 1-3, we will continue to use it. However, to improve on its lacking capabilities in predicting class 4, we will introduce a second model which exclusively predicts class 4.

For this model, the Target variable has been transformed to a binary classifier as to whether the case falls into class 4 or not. In this model, we want as high of an specificity as we can reasonably get, since false positives won’t be classified again by model 2. Since recall (the amount of true positives divided by the total predicted positives) appears to scale linearly with threshold, we took the highest threshold with an acceptable \(\kappa\) and AUC score. We chose 0.75 for it’s combination of high values in all three metrics. The results are shown in the plot below:

Now that we have a threshold selected for the binary model, all we need to do is pass the values that it predicts are not 4 to the original model. The results are shown in the following table:

Final Model 1 2 3 4
Sensitivity 0.50 0.30 0.30 0.68
Specificity 0.88 0.87 0.84 0.80

While the sensitivity for observations in class 4 may have fallen by a decent margin, its specificity rose to compensate. While metrics for class 2 predictions are more or less the same, metrics for class 1 and 3 predictions have seen substantial improvements. This model shows a greater emphasis on correctly identifying those vulnerable to poverty. This contrasts against our initial model which over emphasized correctly identifying those not vulnerable to poverty over identification of other classifications. We believe the new model will suggest a better allocation of resources across all poverty vulnerability classes than the initial model would have suggested.

Results Summary

In the context of this data and the problem we are trying to address we believe the true positive rate should be optimized. A greater true positive rate will help to correctly distribute resources to vulnerable populations as those in worse tiers of poverty may need additional supports than those who are less vulnerable. Overall we have determined the imbalance in our data makes it hard for the models to classify true positive cases. This can be seen in the extremely low sensitivity and overall classification into classes 1 and 3 which are less prevalent in the data. We tried two techniques to optimize the true positive rate including probability weights and under-sampling to balance the data. Both techniques lead to higher sensitivity values for classes 1 and 3 as we intended. Ultimately we settled on the inverse proportion probability weights and a multi-model approach to optimize non-vulnerable households first and then optimize the other three classes afterwards. This was done to ensure the model accounted for all of the data in the non-vulnerable population which would not be the case when using under-sampling techniques. Therefore, the multi-model approach resulted in a greater sensitivity rate for the three vulnerable classes while not reducing the non-vulnerable classes sensitivity too much. Overall we believe our final model minimizes the limitations of all techniques employed. The most notable benefits of our approach include the utilization of as much data as possible and class weighting to alleviate over-fitting on the majority class in our data. Because of this we believe our final model is the best of the models constructed for accurately classifying poverty vulnerabilities in a way that should enable a more equitable distribution of resources.